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Abstract 

We propose a general approach to compute the seed sensitivity, that can be applied to different 
definitions of seeds. It treats separately three components of the seed sensitivity problem - a set of 
target alignments, an associated probability distribution, and a seed model - that are specified by distinct 
finite automata. The approach is then applied to a new concept of subset seeds for which we propose 
an efficient automaton construction. Experimental results confirm that sensitive subset seeds can be 
efficiently designed using our approach, and can then be used in similarity search producing better results 
than ordinary spaced seeds. 

1 Introduction 

In the framework of pattern matching and similarity search in biological sequences, seeds specify a class 
of short sequence motif which, if shared by two sequences, are assumed to witness a potential similarity. 
Spaced seeds have been introduced several years ago |8] ED and have been shown to improve significantly 
the efficiency of the search. One of the key problems associated with spaced seeds is a precise estimation 
of the sensitivity of the associated search method. This is important for comparing seeds and for choosing 
most appropriate seeds for a sequence comparison problem to solve. 

The problem of seed sensitivity depends on several components. First, it depends on the seed model 
specifying the class of allowed seeds and the way that seeds match (hit) potential alignments. In the basic 
case, seeds are specified by binary words of certain length (span), possibly with a constraint on the number 
of l's (weight). However, different extensions of this basic seed model have been proposed in the literature, 
such as multi-seed (or multi-hit) strategies [HI ED, seed families El |20| El GH E3 16), seeds over 
non-binary alphabets {9l [l9ll . vector seeds H|6]|. 

The second parameter is the class of target alignments that are alignment fragments that one aims to 
detect. Usually, these are gapless alignments of a given length. Gapless alignments are easy to model, in the 
simplest case they are represented by binary sequences in the match/mismatch alphabet. This representation 
has been adopted by many authors I T8l IT3l f5l ITUl l7l fTTI . The binary representation, however, cannot distin- 
guish between different types of matches and mismatches, and is clearly insufficient in the case of protein 
sequences. In H]|6), an alignment is represented by a sequence of real numbers that are scores of matches 
or mismatches at corresponding positions. A related, but yet different approach is suggested in [19], where 
DNA alignments are represented by sequences on the ternary alphabet of match/transition/transversion. 
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Finally, another generalization of simple binary sequences was considered in [15], where alignments are 
required to be homogeneous, i.e. to contain no sub-alignment with a score larger than the entire alignment. 

The third necessary ingredient for seed sensitivity estimation is the probability distribution on the set of 
target alignments. Again, in the simplest case, alignment sequences are assumed to obey a Bernoulli model 
ITSl [TO I. In more general settings, Markov or Hidden Markov models are considered QH). A different 
way of defining probabilities on binary alignments has been taken in fi31 : all homogeneous alignments of a 
given length are considered equiprobable. 

Several algorithms for computing the seed sensitivity for different frameworks have been proposed in 
the above-mentioned papers. All of them, however, use a common dynamic programming (DP) approach, 
first brought up in fT3ll . 

In the present paper, we propose a general approach to computing the seed sensitivity. This approach 
subsumes the cases considered in the above-mentioned papers, and allows to deal with new combinations 
of the three seed sensitivity parameters. The underlying idea of our approach is to specify each of the three 
components - the seed, the set of target alignments, and the probability distribution - by a separate finite 
automaton. 

A deterministic finite automaton (DFA) that recognizes all alignments matched by given seeds was 
already used in [7| for the case of ordinary spaced seeds. In this paper, we assume that the set of target 
alignments is also specified by a DFA and, more importantly, that the probabilistic model is specified by a 
probability transducer - a probability-generating finite automaton equivalent to HMM with respect to the 
class of generated probability distributions. 

We show that once these three automata are set, the seed sensitivity can be computed by a unique gen- 
eral algorithm. This algorithm reduces the problem to a computation of the total weight over all paths in an 
acyclic graph corresponding to the automaton resulting from the product of the three automata. This com- 
putation can be done by a well-known dynamic programming algorithm ll2"Tl \¥2 | with the time complexity 
proportional to the number of transitions of the resulting automaton. Interestingly, all above-mentioned 
seed sensitivity algorithms considered by different authors can be reformulated as instances of this general 
algorithm. 

In the second part of this work, we study a new concept of subset seeds - an extension of spaced seeds 
that allows to deal with a non-binary alignment alphabet and, on the other hand, still allows an efficient 
hashing method to locate seeds. For this definition of seeds, we define a DFA with a number of states 
independent of the size of the alignment alphabet. Reduced to the case of ordinary spaced seeds, this DFA 
construction gives the same worst-case number of states as the Aho-Corasick DFA used in 0. Moreover, 
our DFA has always no more states than the DFA of [7], and has substantially less states on average. 

Together with the general approach proposed in the first part, our DFA gives an efficient algorithm for 
computing the sensitivity of subset seeds, for different classes of target alignments and different probability 
transducers. In the experimental part of this work, we confirm this by running an implementation of our 
algorithm in order to design efficient subset seeds for different probabilistic models, trained on real genomic 
data. We also show experimentally that designed subset seeds allow to find more significant alignments than 
ordinary spaced seeds of equivalent selectivity. 

2 General Framework 

Estimating the seed sensitivity amounts to compute the probability for a random word (target alignment), 
drawn according to a given probabilistic model, to belong to a given language, namely the language of all 
alignments matched by a given seed (or a set of seeds). 



2 



2.1 Target Alignments 

Target alignments are represented by words over an alignment alphabet A. In the simplest case, consid- 
ered most often, the alphabet is binary and expresses a match or a mismatch occurring at each align- 
ment column. However, it could be useful to consider larger alphabets, such as the ternary alphabet of 
match/transition/transversion for the case of DNA (see [ 19 1). The importance of this extension is even more 
evident for the protein case ([6]), where different types of amino acid pairs are generally distinguished. 

Usually, the set of target alignments is a finite set. In the case considered most often fTHl [T3l |5] EH 
^D. target alignments are all words of a given length n. This set is trivially a regular language that 
can be specified by a deterministic automaton with (n + 1) states. However, more complex definitions of 
target alignments have been considered (see e.g. U3l) that aim to capture more adequately properties of 
biologically relevant alignments. In general, we assume that the set of target alignments is a finite regular 
language Lt € A* and thus can be represented by an acyclic DFA T =< Qt, q T , q^, A, tpr >■ 

2.2 Probability Assignment 

Once an alignment language Lt has been set, we have to define a probability distribution on the words of 
Lt- We do this using probability transducers. 

A probability transducer is a finite automaton without final states in which each transition outputs a 
probability. 

Definition 1. A probability transducer G over an alphabet A is a 4-tuple < Qg, q G ,A,pc >, where Qg is 
a finite set of states, q G 6 Qg is an initial state, and pg : Qg x Ax Qg —> [0, 1] is a real- valued probability 
function such that 

V? G Qg, J2 q 'eQ G ,aeA Pg{q, a, q') = 1. 

A transition of G is a triplet e =< q, a, q' > such that p(q, a, q') > 0. Letter a is called the label of e 
and denoted label(e). A probability transducer G is deterministic if for each q € Qg and each a € A, there 
is at most one transition < q,a,q' >. For each path P = (e±, e n ) in G, we define its label to be the word 
label (P) = label (ei)... label (e n ), and the associated probability to be the product p(P) = YYi=i Pc(si)- A 
path is initial, if its start state is the initial state q G of the transducer G. 

Definition 2. The probability of a word w £ A* according to a probability transducer G =< Qg, q%,A,pc >, 
denoted Vg{w), is the sum of probabilities of all initial paths in G with the label w. Vg{w) = if no such 
path exists. The probability Vg{L) of a finite language L C A* according a probability transducer G is 
defined by Vg{L) = Y^ w &l'Pg{w)- 

Note that for any n and for L = A n (all words of length n), Vg{L) = 1. 

Probability transducers can express common probability distributions on words (alignments). Bernoulli 
sequences with independent probabilities of each symbol ffSl ITOl [Till can be specified with deterministic 
one-state probability transducers. In Markov sequences of order k [7, 20 1, the probability of each symbol 
depends on k previous symbols. They can therefore be specified by a deterministic probability transducer 
with at most \A\ k states. 

A Hidden Markov model (HMM) [ 5 ] corresponds, in general, to a non-deterministic probability trans- 
ducer. The states of this transducer correspond to the (hidden) states of the HMM, plus possibly an ad- 
ditional initial state. Inversely, for each probability transducer, one can construct an HMM generating the 
same probability distribution on words. Therefore, non-deterministic probability transducers and HMMs 
are equivalent with respect to the class of generated probability distributions. The proofs are straightforward 
and are omitted due to space limitations. 
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2.3 Seed automata and seed sensitivity 

Since the advent of spaced seeds JHJED, different extensions of this idea have been proposed in the literature 
(see Introduction). For all of them, the set of possible alignment fragments matched by a seed (or by a set 
of seeds) is a finite set, and therefore the set of matched alignments is a regular language. For the original 
spaced seed model, this observation was used by Buhler et al. Q who proposed an algorithm for computing 
the seed sensitivity based on a DFA defining the language of alignments matched by the seed. In this paper, 
we extend this approach to a general one that allows a uniform computation of seed sensitivity for a wide 
class of settings including different probability distributions on target alignments, as well as different seed 
definitions. 

Consider a seed (or a set of seeds) it under a given seed model. We assume that the set of alignments L v 
matched by it is a regular language recognized by a DFA S w =< Qs, q% Q$ > A, V'S >■ Consider a finite 
set Lt of target alignments and a probability transducer G. Under this assumptions, the sensitivity of tt is 
defined as the conditional probability 

V G {L T n L„) 
Vg(Lt) 

An automaton recognizing L = Lt D L w can be obtained as the product of automata T and S n recog- 
nizing Lt and L n respectively. Let K =< Qk, q K , Q^, A, ipK > be this automaton. We now consider the 
product W of K and G, denoted K x G, defined as follows. 

Definition 3. GivenaDFAiT =< QK,q%,Q^,A,t()K > and a probability transducer G =< QG-,q%-,A, pG >, 
the product of K and G is the probability-weighted automaton W =< Qw, <?{y, Qwi A Pw > (f° r short, 
PVf -automaton) such that 

• Qw = Qk x Qg, 

• & = 

• Qw = {(QK,qG)\qK e Q K }, 



Pw((qK,qG),a, Wk^g)) 



Pc(qG,a,q' G ) if i/j K (qK,a) = q' K , 
otherwise. 



W can be viewed as a non-deterministic probability transducer with final states. pw{{qK, qc), a> {q'ki i'g)) 
is the probability of the transition < (qK,qc), a, (q' K , q' G ) >. A path in W is called full if it goes from the 
initial to a final state. 

Lemma 4. Let G be a probability transducer. Let L be a finite language and K be a deterministic automaton 
recognizing L. Let W = G x K. The probability Vg(L) is equal to sum of probabilities of all full paths in 
W. 

Proof. Since K is a deterministic automaton, each word w G L corresponds to a single accepting path in 
K and the paths in G labeled w (see Definition are in one-to-one correspondence with the full path in W 
accepting w. By definition, Vg(w) is equal to the sum of probabilities of all paths in G labeled w. Each 
such path corresponds to a unique path in W, with the same probability. Therefore, the probability of w is 
the sum of probabilities of corresponding paths in W. Each such path is a full path, and paths for distinct 
words w are disjoint. The lemma follows. □ 
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2.4 Computing Seed Sensitivity 

Lemma |3] reduces the computation of seed sensitivity to a computation of the sum of probabilities of paths 
in a PW-automaton. 

Lemma 5. Consider an alignment alphabet A, a finite set Lj> Q A* of target alignments, and a set L w C 
A* of all alignments matched by a given seed tt. Let K =< Qk, Qt: Qk> V'Q > ^ e an acyclic DFA 
recognizing the language L = Lj> n L n . Let further G =< Qg,Qq,A,p > be a probability transducer 
defining a probability distribution on the set Lt- Then Vq(L) can be computed in time 

0(\Qg\ 2 -\Qk\-\A\) (2) 

and space 

0{\Q G \-\Q K \). (3) 

Proof. By Lemma|4] the probability of L with respect to G can be computed as the sum of probabilities of 
all full paths in W. Since K is an acyclic automaton, so is W. Therefore, the sum of probabilities of all full 
paths in W leading to final states can be computed by a classical DP algorithm [21 ] applied to acyclic 
directed graphs ([ 12 1 presents a survey of application of this technique to different bioinformatic problems). 
The time complexity of the algorithm is proportional to the number of transitions in W. W has \Qg\ • \Qi< \ 
states, and for each letter of A, each state has at most \Qg\ outgoing transitions. The bounds follow. □ 

Lemma |5] provides a general approach to compute the seed sensitivity. To apply the approach, one has 
to define three automata: 

• a deterministic acyclic DFA T specifying a set of target alignments over an alphabet A (e.g. all words 
of a given length, possibly verifying some additional properties), 

• a (generally non-deterministic) probability transducer G specifying a probability distribution on target 
alignments (e.g. Bernoulli model, Markov sequence of order k, HMM), 

• a deterministic DFA S w specifying the seed model via a set of matched alignments. 

As soon as these three automata are defined, Lemma[5]can be used to compute probabilities Vg(Lt H L n ) 
and Vg(Lt) hi order to estimate the seed sensitivity according to Q. 

Note that if the probability transducer G is deterministic (as it is the case for Bernoulli models or Markov 
sequences), then the time complexity $2% is 0(\Qg\ • \Qk\ ■ \ In general, the complexity of the algorithm 
can be improved by reducing the involved automata. Buhler et al. Q introduced the idea of using the 
Aho-Corasick automaton 1 1 1 as the seed automaton S n for a spaced seed. The authors of [7 1 considered all 
binary alignments of a fixed length n distributed according to a Markov model of order k. In this setting, 
the obtained complexity was 0(w2 s - w 2 k n), where s and w are seed's span and weight respectively. Given 
that the size of the Aho-Corasick automaton is 0(w2 s ~ w ), this complexity is automatically implied by 
Lemma |5j as the size of the probability transducer is 0(2 k ), and that of the target alignment automaton is 
0(n). Compared to Q, our approach explicitly distinguishes the descriptions of matched alignments and 
their probabilities, which allows us to automatically extend the algorithm to more general cases. 

Note that the idea of using the Aho-Corasick automaton can be applied to more general seed models than 
individual spaced seeds (e.g. to multiple spaced seeds, as pointed out in 171). In fact, all currently proposed 
seed models can be described by a finite set of matched alignment fragments, for which the Aho-Corasick 
automaton can be constructed. We will use this remark in later sections. 
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The sensitivity of a spaced seed with respect to an HMM-specified probability distribution over binary 
target alignments of a given length n was studied by Brejova et al. Q. The DP algorithm of Q has a 
lot in common with the algorithm implied by Lemma |5] In particular, the states of the algorithm of 0] 
are triples < w,q,m >, where w is a prefix of the seed n, q is a state of the HMM, and m G [0..n]. 
The states therefore correspond to the construction implied by Lemma |5] However, the authors of 0] do 
not consider any automata, which does not allow to optimize the preprocessing step (counterpart of the 
automaton construction) and, on the other hand, does not allow to extend the algorithm to more general seed 
models and/or different sets of target alignments. 

A key to an efficient solution of the sensitivity problem remains the definition of the seed. It should be 
expressive enough to be able to take into account properties of biological sequences. On the other hand, it 
should be simple enough to be able to locate seeds fast and to get an efficient algorithm for computing seed 
sensitivity. According to the approach presented in this section, the latter is directly related to the size of a 
DFA specifying the seed. 

3 Subset seeds 
3.1 Definition 

Ordinary spaced seeds use the simplest possible binary "match-mismatch" alignment model that allows an 
efficient implementation by hashing all occurring combinations of matching positions. A powerful gener- 
alization of spaced seeds, called vector seeds, has been introduced in |4j. Vector seeds allow one to use an 
arbitrary alignment alphabet and, on the other hand, provide a flexible definition of a hit based on a coopera- 
tive contribution of seed positions. A much higher expressiveness of vector seeds lead to more complicated 
algorithms and, in particular, prevents the application of direct hashing methods at the seed location stage. 

In this section, we consider subset seeds that have an intermediate expressiveness between spaced and 
vector seeds. It allows an arbitrary alignment alphabet and, on the other hand, still allows using a direct 
hashing for locating seed, which maps each string to a unique entry of the hash table. We also propose a 
construction of a seed automaton for subset seeds, different from the Aho-Corasick automaton. The automa- 
ton has 0(w2 s ~ w ) states regardless of the size of the alignment alphabet, where s and w are respectively 
the span of the seed and the number of "must-match" positions. From the general algorithmic framework 
presented in the previous section (Lemma this implies that the seed sensitivity can be computed for 
subset seeds with same complexity as for ordinary spaced seeds. Note also that for the binary alignment 
alphabet, this bound is the same as the one implied by the Aho-Corasick automaton. However, for larger 
alphabets, the Aho-Corasick construction leads to 0(w\A\ s ~ w ) states. In the experimental part of this paper 
(section I4.lt we will show that even for the binary alphabet, our automaton construction yields a smaller 
number of states in practice. 

Consider an alignment alphabet A. We always assume that A contains a symbol 1, interpreted as 
"match". A subset seed is defined as a word over a seed alphabet B, such that 

• letters of B denote subsets of the alignment alphabet A containing l(BC{l}u 2" 4 ), 

• B contains a letter # that denotes subset {l}, 

• a subset seed • • • b m G B m matches an alignment fragment a±a2 ■ ■ ■ a m G A m if Vi G [l..m], 

di G b { . 

The ^-weight of a subset seed 7r is the number of # in ir and the span of it is its length. 
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Example 1. [19] considered the alignment alphabet A = {l,h, 0} representing respectively a match, a 
transition mismatch, or a transversion mismatch in a DNA sequence alignment. The seed alphabet is 
B = {#,©,_} denoting respectively subsets {l}, {l,h}, and {l,h, 0}. Thus, seed tt = matches 
alignment s = lOhlhllOl at positions 4 and 6. The span of tt is 4, and the ^-weight of tt is 2. 

Note that unlike the weight of ordinary spaced seeds, the ^-weight cannot serve as a measure of seed 
selectivity. In the above example, symbol @ should be assigned weight 0.5, so that the weight of tt is equal 
to 2.5 (see 030). 



3.2 Subset Seed Automaton 

Let us fix an alignment alphabet A, a seed alphabet B, and a seed tt = 7Ti7T2 . . . vr m G B* of span m and 
^-weight w. Let _R„. be the set of all non-# positions in tt, \R^\ = r = m — w. We now define an automaton 
S w =< Q,qo,Qj,A, ip : Q x A — > Q > that recognizes the set of all alignments matched by tt. 

The states Q of S n are pairs < X, t > such that X C i?^, t G [0, . . . , m), with the following invariant 
condition. Suppose that S w has read a prefix si . . . s p of an alignment s and has come to a state < X, t >. 
Then £ is the length of the longest suffix of si . . . s p of the form 1*, % < m, and X contains all positions 
Xi G i^Tr such that prefix tti • ■ ■ vr Xi of -k matches a suffix of s\ • ■ ■ s p _j. Sg f 

HlhlOllhTl . . . 

^ ^=#@#_##_### ^ 7T1.. 7 =#@#_ ##_ 

( b ) 7r 1..4=#@#_ 

w s = lllhlOllhll . . . vr L .2=#@ 

Figure 1: Illustration to Example |2] 



Example 2. In the framework of Example^ consider a seed 7r and an alignment prefix s of length p = 11 
given on Figure ^a) and (b) respectively. The length t of the last run of l's of s is 2. The last mismatch 
position of s is sg = h. The set R n of non-# positions of 7r is {2, 4, 7} and it has 3 prefixes ending at 
positions of R n (Figure[nc)). Prefixes tt\..i and 7T1..7 do match suffixes of s\S2 ■ • • sg, and prefix tt\.a does 
not. Thus, the state of the automaton after reading s\S2 . . . sn is < {2, 7}, 2 >. 



The initial state go of SV is the state < 0, >. The final states Qf of S n are all states q =< X,t >, 
where max{X} + t = m. All final states are merged into one state. 

The transition function ip(q, a) is defined as follows: If q is a final state, then Va G A, ip(q, a) = q. If 
q =< X, t > is a non-final state, then 

• if a = 1 then ^(q, a) =< X, t + 1 >, 

• otherwise ^(g, a) =< Xj/ U Xy, > with 

- X[/ = {x|x < t + 1 and a matches tt x } 

- Xy = {x + t + l\x £ X and a matches ir x +t+i} 

Lemma 6. The automaton SV accepts the set of all alignments matched by it. 

Proof. It can be verified by induction that the invariant condition on the states < X, t > G Q is preserved by 
the transition function tp. The final states verify max{X} + t = m, which implies that tt matches a suffix 
of S\ . . . s„. □ 
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Lemma 7. The number of states of the automaton S n is no more than (w + l)2 r . 

Proof. Assume that R n = {x\,X2, ■ ■ ■ ,x r } and x\ < x?, • ■ ■ < x r . Let Qi be the set of non-final states 
< X, t > with max{X} = X{, i £ [L-t]. For states q =< X, t >G Qi there are 2*~ 1 possible values of X 
and m — Xi possible values of t, as max{X} + 1 < m — 1. 
Thus, 

|Qi| < 2 i - 1 (m-a; i ) < 2^ 1 (m-i), and (4) 

r r 

^2\Qi\ < '^22 i - 1 (m-i) = (m - r + l)2 r - m - 1. (5) 

i=l i=l 

Besides states Qi, Q contains m states < 0,t > (t G [0..m — 1]) and one final state. Thus, |Q| < 
(m - r + l)2 r = (w + l)2 r . □ 

Note that if it starts with which is always the case for ordinary spaced seeds, then Xi > i + 1, 
i € [L.r], and the bound of © rewrites to 2 i_1 (m — i — 1). This results in the same number of states w2 r 
as for the Aho-Corasick automaton The construction of automaton S w is optimal, in the sense that no 
two states can be merged in general, as the following Lemma states. 

Lemma 8. Consider a spaced seed tt which consists of two "must-match" symbols # separated by r jokers. 
Then the automaton S n is reduced, that is any non-final state is reachable from the initial state qo, and any 
two non-final states q, q' are non-equivalent. 

Proof. See appendix □ 

A straightforward generation of the transition table of the automaton SV can be performed in time 0(r ■ 
w ■ 2 r ■ \ A\). A more complicated algorithm allows one to reduce the bound to 0(w ■ 2 r ■ \ A\). This algorithm 
is described in full details in Appendix 151 Here we summarize it in the following Lemma. 

Lemma 9. The transition table of automaton S n can be constructed in time proportional to its size, which 
isO(w-2 r ■ \A\). 

In the next section, we demonstrate experimentally that on average, our construction yields a very com- 
pact automaton, close to the minimal one. Together with the general approach of section |2j this provides 
a fast algorithm for computing the sensitivity of subset seeds and, in turn, allows to perform an efficient 
design of spaced seeds well-adapted to the similarity search problem under interest. 

4 Experiments 

Several types of experiments have been performed to test the practical applicability of the results of sec- 
tions |2l3] We focused on DNA similarity search, and set the alignment alphabet A to {l,h, 0} (match, 
transition, trans version). For subset seeds, the seed alphabet B was set to @, _}, where # = {l}, @ = 
{1, h}, _ = {1, h, 0} (see Example[0. The weight of a subset seed is computed by assigning weights 1, 0.5 
and to symbols @ and _ respectively. 
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4.1 Size of the automaton 

We compared the size of the automaton S w defined in section[3]and the Aho-Corasick automaton 1 1 1, both for 
ordinary spaced seeds (binary seed alphabet) and for subset seeds. The Aho-Corasick automaton for spaced 
seeds was constructed as defined in [7 ]. For subset seeds, a straightforward generalization was considered: 
the Aho-Corasick construction was applied to the set of alignment fragments matched by the seed. 

Tables f^a) and \\}b) present the results for spaced seeds and subset seeds respectively. For each seed 
weight w, we computed the average number of states (avg. size) of the Aho-Corasick automaton and our 
automaton S n , and reported the corresponding ratio (5) with respect to the average number of states of the 
minimized automaton. The average was computed over all seeds of span up to w + 8 for spaced seeds and 
all seeds of span up to w + 5 with two @'s for subset seeds. Interestingly, our automaton turns out to be more 



Spaced 


Aho-Corasick 






Minimized 


Subset 


Aho-Corasick 


Sir 




Minimized 


w 


avg. size 


8 


avg. size 


8 


avg. size 


w 


avg. size 


8 


avg. size 


8 


avg. size 


9 


345.94 


3.06 


146.28 


1.29 


113.21 


9 


1900.65 


15.97 


167.63 


1.41 


119,00 


10 


380.90 


3.16 


155.11 


1.29 


120.61 


10 


2103.99 


16.50 


177.92 


1.40 


127.49 


11 


415.37 


3.25 


163.81 


1.28 


127.62 


11 


2306.32 


16.96 


188.05 


1.38 


135.95 


12 


449.47 


3.33 


172.38 


1.28 


134.91 


12 


2507.85 


17.42 


198.12 


1.38 


144.00 


13 


483,27 


3.41 


180.89 


1.28 


141.84 


13 


2709.01 


17.78 


208.10 


1.37 


152.29 



(a) (b) 

Table 1 : Comparison of the average number of states of Aho-Corasick automaton, automaton S n of section^ 
and minimized automaton 

compact than the Aho-Corasick automaton not only on non-binary alphabets (which was expected), but also 
on the binary alphabet (cf Table^a)). Note that for a given seed, one can define a surjective mapping from 
the states of the Aho-Corasick automaton onto the states of our automaton. This implies that our automaton 
has always no more states than the Aho-Corasick automaton. 

4.2 Seed Design 

In this part, we considered several probability transducers to design spaced or subset seeds. The target 
alignments included all alignments of length 64 on alphabet {l,h, 0}. Four probability transducers have 
been studied (analogous to those introduced in l3l0: 

• B: Bernoulli model 

• DTI: deterministic probability transducer specifying probabilities of {1, h, 0} at each codon position 
(extension of the model of [3 1 to the three-letter alphabet), 

• DT2: deterministic probability transducer specifying probabilities of each of the 27 codon instances 
{1, h, 0} 3 (extension of the model of to the three-letter alphabet), 

• NT: non-deterministic probability transducer combining four copies of DT2 specifying four distinct 
codon conservation levels (called HMM model in l3lD. 

Models DTI, DT2 and NT have been trained on alignments resulting from a pairwise comparison of 40 
bacteria genomes. Details of the training procedure as well as the resulting parameter values are given in 
Appendix O 

For each of the four probability transducers, we computed the best seed of weight w (w = 9, 10, 11, 12) 
among two categories: ordinary spaced seeds of weight w and subset seeds of weight w with two @. Ordi- 
nary spaced seeds were enumerated exhaustively up to a given span, and for each seed, the sensitivity was 
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computed using the algorithmic approach of section |2] and the seed automaton construction of section |3] 
Each such computation took between 10 and 500ms on a Pentium IV 2.4GHz computer depending on the 
seed weight/span and the model used. In each experiment, the most sensitive seed found has been kept. The 
results are presented in Tables l2ll5l 



U 1 


spaced seeds 


Sens. 


subset seeds, two @ 


Sens. 


o 

V 


###___#_#_##_## 


0.4183 


###_#__#@#_@## 


0.4443 


1 n 


##_##___##_#_### 


0.2876 


###_@#_@#_#_### 


0.3077 


11 


###_###_#__#_### 


0.1906 


##@#__##_#_#_@### 


0.2056 


12 


###_#_##_#__##_### 


0.1375 


##@#_#_##__#@_#### 


0.1481 




Table 2: Best seeds and their sensitivity for probability transducer B 


w 


spaced seeds 


Sens. 


subset seeds, two @ 


Sens. 


9 


###___##_##_## 


0.4350 


##@___##_##_##@ 


0.4456 


10 


##_## ##_##_## 


0.3106 


##_##___@##_##@# 


0.3173 


11 


##_## ##_##_### 


0.2126 


##@#@_##_##__### 


0.2173 


12 


##_## ##_##_#### 


0.1418 


##_@###__##_##@## 


0.1477 


Table 3: Best seeds and their sensitivity for probability transducer DTI 


w 


spaced seeds 


Sens. 


subset seeds, two @ 


Sens. 


9 


#_## ##_##_## 


0.5121 


#_#@_##_@ __##_## 


0.5323 


10 


##_##_## ##_## 


0.3847 


##_@#_##__@_##_## 


0.4011 


11 


##_##__#_#___#_##_## 


0.2813 


##_##_@#_#___#_#@_## 


0.2931 


12 


##_##_##_#___#_##_## 


0.1972 


##_##_#@_##_@ __##_## 


0.2047 



Table 4: Best seeds and their sensitivity for probability transducer DT2 



In all cases, subset seeds yield a better sensitivity than ordinary spaced seeds. The sensitivity increment 
varies up to 0.04 which is a notable increase. As shown in [19|, the gain in using subset seeds increases 
substantially when the transition probability is greater than the inversion probability, which is very often the 
case in related genomes. 

4.3 Comparative performance of spaced and subset seeds 

We performed a series of whole genome comparisons in order to compare the performance of designed 
spaced and subset seeds. Eight complete bacterial genomes 1 have been processed against each other using 
the YASS software [ 19 1. Each comparison was done twice: one with a spaced seed and another with a subset 
seed of the same weight. 

The threshold E-value for the output alignments was set to 10, and for each comparison, the number of 
alignments with E-value smaller than 10~ 3 found by each seed, and the number of exclusive alignments were 
reported. By "exclusive alignment" we mean any alignment of E-value less than 10~ 3 that does not share a 

'NC_000907.fna, NC_002662.fna, NC_003317.fna, NC_003454.fna, NC_004113.fna, NC_001263.fna, NC_0031 12.fna obtained 
from NCBI 
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w 


spaced seeds 


Sens. 


subset seeds, two @ 


Sens. 


9 


##_##_## ##_# 


0.5253 


##_@@_## ##_## 


0.5420 


10 


##_## ##_##_## 


0.4123 


##_## ##_@@_##_# 


0.4190 


11 


##_## ##_##_##_# 


0.3112 


##_## ##_@@_##_## 


0.3219 


12 


##_## ##_##_##_## 


0.2349 


##_## ##_@@_##_##_# 


0.2412 



Table 5: Best seeds and their sensitivity for probability transducer NT 



common part (do not overlap in both compared sequences) with any alignment found by another seed. To 
take into account a possible bias caused by splitting alignments into smaller ones (X-drop effect), we also 
computed the total length of exclusive alignments. Table[6]summarizes these experiments for weights 9 and 
10 and the DT2 and NT probabilistic models. Each line corresponds to a seed given in Table|4]or Tabled 
depending on the indicated probabilistic model. In all cases, best subset seeds detect from 1% to 8% more 



seed 


time 


#align 


^ex. align 


ex. align length 


DT2, w = 9, spaced seed 


15:14 


19101 


1583 


130512 


DT2, w = 9, subset seed, two © 


14:01 


20127 


1686 


141560 


DT2, w = 10, spaced seed 


8:45 


18284 


1105 


10174 


DT2, w = 10, subset seed, two @ 


8:27 


18521 


1351 


12213 


NT, w = 9, spaced seed 


42:23 


20490 


1212 


136049 


NT, w = 9, subset seed, two @ 


41:58 


21305 


1497 


150127 


NT, w = 10, spaced seed 


11:45 


19750 


942 


85208 


NT, w = 10, subset seed, two @ 


10:31 


21652 


1167 


91240 



Table 6: Comparative test of subset seeds vs spaced seeds. Reported execution times (min:sec) were ob- 
tained on a Pentium IV 2.4GHz computer. 

significant alignments compared to best spaced seeds of same weight. 

5 Discussion 

We introduced a general framework for computing the seed sensitivity for various similarity search settings. 
The approach can be seen as a generalization of methods of Q HI in that it allows to obtain algorithms 
with the same worst-case complexity bounds as those proposed in these papers, but also allows to obtain 
efficient algorithms for new formulations of the seed sensitivity problem. This versatility is achieved by 
distinguishing and treating separately the three ingredients of the seed sensitivity problem: a set of target 
alignments, an associated probability distributions, and a seed model. 

We then studied a new concept of subset seeds which represents an interesting compromise between the 
efficiency of spaced seeds and the flexibility of vector seeds. For this type of seeds, we defined an automaton 
with 0(w2 r ) states regardless of the size of the alignment alphabet, and showed that its transition table can 
be constructed in time 0(u;2 r |^4|). Projected to the case of spaced seeds, this construction gives the same 
worst-case bound as the Aho-Corasick automaton of [7 1, but results in a smaller number of states in practice. 
Different experiments we have done confirm the practical efficiency of the whole method, both at the level 
of computing sensitivity for designing good seeds, as well as using those seeds for DNA similarity search. 

As far as the future work is concerned, it would be interesting to study the design of efficient spaced 
seeds for protein sequence search (see [6|), as well as to combine spaced seeds with other techniques such 
as seed families H7ll20l[T6l or the group hit criterion [19]. 
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A Proof of Lemma HI 



Let 7r = # — r # be a spaced seed of span r + 2 and weight 2. We prove that the automaton (see 
Lemma|6j is reduced, i.e. 

(i) all its non-final states are reachable from the initial state < 0, >; 

(ii) any two non-final states q, q' are non-equivalent, i.e. there is a word w = w(q, q') such that exactly 
one of the states ip(q, w),ip(q' ,w) is a final state. 

(i) Let q =< X,t > be a state of the automaton S w , and let X = {x±, . . . , x^} and x\ < ■ ■ ■ < xp,. 
Obviously, xu + t < r + 2. Let s G {0, 1}* be an alignment word of length x\~ such that for all i G 
[l,Xfc], Si = 1 iff 3j G [1, k], i = xj~ — Xj + 1. Note, that, tt\ = #, therefore 1 ^ X and s Xk = 0. 
Finally, ip(< <p,0>,s- 1*) = q. 

(ii) Let qi =< X\,t\ > and q2 =< X2,t2 > be non-final states of S^. Let X\ = {yi, . . . , y a }, X2 = 
{zi,..., z b }, and y\ < ■ ■ ■ < y a , z\ < ■ ■ ■ < z b . 

Assume that max{Xi} + t\ > max{X2} + *2 and let d = (r + 2) — (max{X\\ + t\). Obviously, 
ip(qi, l d ) is a final state, and V'( ( ?2, l d ) is not. Now assume that max{X{\ + 1\ = max{X2] + ti- For a 
set X C {1, . . . , r + 1} and a number t, define a set X{t} by = + t\v G X and v + t < r + 2}. 

Let <? = max{v|(t> -Mi G X\ and t> + ti £ X2) or (u + £2 G X2 and v + t\ ^ X\j) and let d = r + 1 — g 
. Then ^(gi, d • 1) is a final state and ip(q2, d ■ 1) is not or vice versa. This completes the proof. 

B Subset seed automaton 

Let 7r be a subset seed of #-weight w and span s, and r = s — w be the number of non-# positions. We 
define a DFA S n recognizing all words of A* matched by ir (see definition of section ITTl . The transition 
table of S-n- is stored in an array such that each element describes a state < X, t > of S n . Now we define 

1. how to compute the array index Ind(q) of a state q =< X, t >, 

2. how to compute values ip(q, a) given a state q and a letter a G A 

B.l Encoding state indexes 

We will need some notation. Let L = {li, . . . , l r } be a set of all non-# positions in tt (l± < I2 < ■ ■ ■ < l r )- 
For a subset X C L, let = . . . v r G {0, l} r be a binary vector such that Vi = 1 iff li G X. Let 

further n(X) be the integer corresponding to the binary representation v{X) (read from left to right): 

r 

n(X)=J2V- 1 -v j . 
i=i 

Define = max{p \ l p < m — t}. Informally, for a given non-final state < X, t >, X can only be a 
subset of . . . , l p (t)}- This implies that n(X) < 2 p W. Then, the index of a given state {< X, t >} in the 
array is defined by 

Ind{< X,t >) = n(X) + 2 P ^. 

This implies that the worst-case size of the array is no more than w2 r (the proof is similar to the proof of 
Lemma |7}. 
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B.2 Computing transition function ip(q, a) 

We compute values ip(< X,t >,a) based on already computed values ip(< X',t >,a). Let q =< X, t > 
be a non-final and reachable state of S n , where X = {h,. . . ,lk} with h < h ■ ■ ■ < h an d k < r. Let 
X' = X \ {Ik} = {h, ■ ■ ■ , lk-i} and q' =< X', t >. Then the following lemma holds. 

Lemma 10. If q =< X,t > is reachable, then q' =< X' ,t > is reachable and has been processed before. 

Proof. First prove that < X',t > is reachable. If < X, t > is reachable, then < X, > is reachable due 
to the definition of transition function for t > 0. Thus, one can find at least one sequence S G A lk such 
that Vi G [L.r], Zj G X iff 7i"i • • • 7T/ i matches Sj • • • S^,. For such a sequence S, one can find a word 
S' = Si, + i ■ ■ ■ Si k which reaches state < X', >. To conclude, if there exists a word S ■ 1* that 
reaches the state < X, t >, there also exists a word S' ■ 1* that reaches < X' ,t >. 

Note that as \S' ■ 1*| < \S ■ then a breadth-first computation of states of SV always processes state 
< X',t> before < X, i >. □ 

Now we present how to compute values tp(< X,t >, a) from values ip(< X',t >,a). This is done by 
Algorithm IB .2l shown below, that we comment on now. Due to implementation choices, we represent a state 
q as triple q = (X,kx,t), where kx = max{i\li G X}. Note first that if a = 1, the transition function 
ip(q, a) can be computed in constant time due to its definition (part a. of Algorithm IB .2t . If a 7^ 1, we have 
to 

1. retrieve the index of q' given q = (X, kx,t) (part c. of Algorithm IB.2t . 

2. compute ip((X, kx, t), a ^ 1) given ip((X' , kx',t), a ^ 1) value, (part d. of Algorithm IB .2t 

1. Note first that Ind((X, kx,t)) = Ind((X', kx> , t}) — 2 kx , which can be computed in constant time 
since kx is explicitly stored in the current state. 



2. Let 



and 



V x (k,t,a^l) 



V k (k,t,a^l) 



li if li = Ik + t + 1 and a matches 7T/ i 

otherwise 

1 if Zj = Zfc + t + 1 and a matches 7T/ i 
otherwise 



Tables Vx(k, t, a) and Vk(k, t, a) can be precomputed in time and space 0(\A\-m 2 ). Let ip((X, kx,t),a) 
(Y, ky,0) and ip((X', kx>, t), a) = (Y',Zcy,0). The set Y differs from Y' at most with one element. 
This element can be computed in constant time using tables Vx, V/.. Namely Y = Y' U Vx(kx, t, a) and 
ky = max(kY', Vk(kx,t, a)). 

Note that a final situation arises when X = 0. (part b. of Algorithm IB .2i . One also has to compute two 
tables Ux, U k defined as: 

Ux(t, o^l) = U{x|x < t + 1 and a matches tt x } 
Uk(t, o^l) = max{a;|x < t + 1 and a matches vr^} 



Lemma 11. 77ie transition function ip{q,a) can be computed in constant time for every reachable state q 
and every a G A. 
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Algorithm 1: S n computation 



Data : a seed -k of span m, '^'-weight w, and number of jokers r = m — w 
Result : an automaton S n =< Q, qo, qp, A, ip > 

Q.add(q F ); 

go ^(X = 0,A; = O,t = O); 
Q.add(qo); 
queue. push{qo); 
while queue ^ do 

(X,k x ,tx) = queue.popQ; 
for a € A do 

/* compute -ip(< X, tx >,<i) = (Y, ky , ty) */ 
if a = 1 then 



se 



ty <- 




ky < 




Y <- 





if X = 0then 

Y<-U x (t x ,a); 
ky <- U k (t x ,a); 

se 

/* wre already processed ip{< X' , tx' >, a) . . . */ 
X' <— X \ 

(y',/cy/,ty/) «- X',t >,a); 
/* . ..to compute ip(< X, tx >, a) */ 
ky <- max(fcy/,l4(fc x ,tjf,a)); 

ty ^0; 

if L[fey] + ty > m then 

/* < Y,ty > is a final state */ 
ip(< X,t x >,a) <- 



se 



if (y,fcy,ty) ^ Qthen 
Q.add((y,A;y,ty}); 
queue.push((Y, ky, ty}); 

Y>(< >,a) <- (y,A;y,ty); 
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C Training probability transducers 



We selected 40 bacterial complete genomes from NCBI: NC.0001 17.fna, NC_000907.fna, NC_000909.fna, NC_000922.fna, 
NC_000962.fna, NC.001263.fna, NC.001318.fna, NC_002162.fna, NC.002488.fna, NC_002505.fna, NC_002516.fna, NC.002662.fna, 
NC_002678.fna, NC.002696.fna, NC.002737.fna, NC_002927.fna, NC.003037.fna, NC_003062.fna, NC.0031 12.fna, NC.003210.fna, 
NC_003295.fna, NC.003317.fna, NC.003454.fna, NC_003551.fna, NC.003869.fna, NC_003995.fna, NC_0041 13.fna, NC.004307.fna, 
NC_004342.fna, NC .00455 l.fna, NC .00463 l.fna, NC_004668.fna, NC.004757.fna, NC_005027.fna, NC_005061.fna, NC_005085.fna, 
NC_005125.fna, NC_005213.fna, NC.005303.fna, NC.005363.fna . 

YASS [19 1 has been run on each pair of genomes to detect alignments with E-value at most 10~ 3 . 
Resulting ungapped regions of length 64 or more have been used to train models DTI, DT2 and NT by the 
maximal likelihood criterion. TableQgives the p function of the probability transducer DTI, that specifies 
the probabilities of match (1), transition (h) and transversion (0) at each codon position. 



a : 





h 


1 


p(<?0,a 


n) 


0.2398 


0.2945 


0.4657 


p(qi,a 


92) 


0.1351 


0.1526 


0.7123 


p(Q2,a 


go) 


0.1362 


0.1489 


0.7150 



Table 7: Parameters of the DTI model 

Table [8] specifies the probability of each codon instance 010203 £ A 3 , used to define the probability 
transducer DT2. 



ai\a 2 a3 : 


00 


Oh 


01 


hO 


hh 


hi 


10 


lh 


11 



h 
1 


0.01089 
0.01022 
0.02083 


0.01329 
0.00984 
0.02158 


0.01311 
0.01093 
0.02554 


0.01107 
0.00956 
0.02537 


0.00924 
0.01025 
0.02604 


0.01144 
0.01294 
0.03776 


0.01887 
0.02155 
0.11298 


0.01946 
0.02552 
0.16165 


0.03106 
0.03983 
0.27915 



Table 8: Probability of each codon instance specified by the DT2 model 

Finally, Table |9] specifies the probability transducer NT by specifying the four DT2 models together 
with transition probabilities between the initial states of each of these models. 









Pr( qi 


-9.) 


3 = 


1 2 


3 














i = 


= 


0.9053 


0.0947 



















1 


0.1799 


0.6963 0.1238 
















2 





0.2131 0.6959 0.0910 














3 


0.0699 


0.0413 0.1287 0.7601 






ai\a2<i3 : 


00 


Oh 


01 


hO 


hh 


hi 


10 


lh 


11 







0.01577 


0.01742 


0.01440 


0.01511 


0.01215 


0.01135 


0.02502 


0.02353 


0.02786 


90 


h 


0.01478 


0.01365 


0.01266 


0.01348 


0.01324 


0.01346 


0.02815 


0.02981 


0.03442 




1 


0.02701 


0.02838 


0.02600 


0.03429 


0.03158 


0.03406 


0.12973 


0.17461 


0.17809 







0.00962 


0.01241 


0.01501 


0.00891 


0.00753 


0.01247 


0.01791 


0.01841 


0.03530 


91 


h 


0.00818 


0.00766 


0.01115 


0.00738 


0.00952 


0.01353 


0.01828 


0.02978 


0.04405 




1 


0.01946 


0.01682 


0.02344 


0.02456 


0.02668 


0.03890 


0.12113 


0.18170 


0.26020 







0.00406 


0.00692 


0.00954 


0.00501 


0.00372 


0.00841 


0.01034 


0.01129 


0.03430 


92 


h 


0.00391 


0.00396 


0.00758 


0.00364 


0.00707 


0.01473 


0.01288 


0.01975 


0.05058 




1 


0.01250 


0.01627 


0.02416 


0.01419 


0.02071 


0.04427 


0.10014 


0.15311 


0.39698 







0.00302 


0.00267 


0.00560 


0.00289 


0.00249 


0.00807 


0.00740 


0.00710 


0.03195 


93 


h 


0.00297 


0.00261 


0.00355 


0.00299 


0.00271 


0.00935 


0.00924 


0.01148 


0.04296 




1 


0.01035 


0.01125 


0.02204 


0.00930 


0.01289 


0.04235 


0.05304 


0.08163 


0.59810 



Table 9: Probabilities specified by the NT model 
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